MEDB 5505, Module03
2025-02-01
Topics to be covered
- What you will learn
- Reading text files
- Comma delimited files
- Tab delimited files
- Other delimiters
- Fixed width files
- Real world examples
- Your programming assignment
Text files, 1
- Advantages
- Easy import into many programs
- Review using notepad
- Disadvantages
- Bigger size
- Slower to import
Text files, 2
- Wide range of formats
- First row for variable names
- Always look for a data dictionary
Should I download before reading?
- Read directly from website
- Convenient
- Updates incorporated at each run
- Download then read
- Downloaded file doesn’t disappear
- Avoid repeated long downloads
- Work even when Internet connection is down
No data dictionary?
- Peek at file
- Same number of delimiters on each line
- Tabs versus multiple blanks are hard to distinguish
- http://www.pmean.com/12/pesky.html
No data dictionary?
- Experiment
- If needed, edit the file manually
- Simple edits of one or two offending lines
- Global search and replace
- Change tabs to blanks
- Change multiple blanks to single blank
Troubleshooting
- Multiple data read in as single variable.
- Lots of missing values
Break #1
- What you have learned
- What’s coming next
An example of a comma delimited file
x,y
1,4
2,8
3,12
4,16
the read_csv function
raw_data <- read_csv(
file="../data/simple.csv",
col_names=TRUE,
col_types="nn")
glimpse(raw_data)
Break #2
- What you have learned
- What’s coming next
Tab delimited files
x y
1 4
2 8
3 12
4 16
Using the read_tsv function
raw_data <- read_tsv(
file="../data/simple.tsv",
col_names=TRUE,
col_types="nn")
glimpse(raw_data)
Break #3
- What you have learned
- What’s coming next
Anything can be a delimiter
x~y
1~4
2~8
3~12
4~16
Using the read_delim function with delim=“~”
raw_data <- read_delim(
file="../data/tilde.txt",
delim="~",
col_names=TRUE,
col_types="nn")
glimpse(raw_data)
Break #4
- What you have learned
- What’s coming next
The read_fwf function
raw_data <- read_fwf(
file="../data/fixed.txt",
col_names=c("x", "y"),
col_positions = fwf_cols(1, 2),
col_types="nn")
glimpse(raw_data)
Helpful functions with read_fwf
- fwf_empty()
- Uses spacing to guess at column positions
- fwf_widths()
- fwf_positions()
- Specifies start and end locations for each column
Break #5
- What you have learned
- What’s coming next
Function arguments for advanced options
- col_select=
- na=
- name_repair=
- skip=
Example 1, binary.csv
Example 1, a brief description
Example 1, viewing the file in Notepad
Example 1, the code to peek at the data
url_binary <- "https://stats.idre.ucla.edu/stat/data/binary.csv"
read_lines(
file=url_binary,
n_max=10)
Example 1, the code to read the data
example_binary <- read_csv(
file=fn,
col_names=TRUE,
col_types="nnnn")
glimpse(example_binary)
Example 2, barbershop-music.txt
Example 2, viewing the file in Notepad
Example 2, the code to peek at the data
url_barbershop <- "https://dasl.datadescription.com/download/data/3061"
read_lines(
file=url_barbershop,
n_max=10)
[1] "Singing\tPerformance\tMusic"
[2] "151\t143\t138"
[3] "152\t146\t136"
[4] "146\t143\t140"
[5] "146\t147\t142"
[6] "145\t141\t134"
[7] "144\t139\t140"
[8] "133\t138\t132"
[9] "129\t135\t128"
[10] "134\t125\t132"
Example 2, the code to read the data
raw_data <- read_tsv(
file=url_barbershop,
col_names=TRUE,
col_types="nnn")
glimpse(raw_data)
Example 3, airport.txt
Example 3, peeking at the file on the web
Example 3, a description of the data
- Here is an excerpt from the data dictionary.
VARIABLE DESCRIPTIONS:
Airport Columns 1-21
City Columns 22-43
Scheduled departures Columns 44-49
Performed departures Columns 51-56
Enplaned passengers Columns 58-65
Enplaned revenue tons of freight Columns 67-75
Enplaned revenue tons of mail Columns 77-85
Example 3, the code to peek at the data
url_airport <- "http://jse.amstat.org/datasets/airport.dat.txt"
read_lines(
file=url_airport,
n_max=10)
Example 3, Defining variable names and column locations
start_column <- c( 1, 22, 44, 51, 58, 67, 77)
end_column <- c(21, 43, 49, 56, 65, 75, 85)
variable_names <- c(
"airport",
"city",
"scheduled_departures",
"performed_departures",
"enplaned_passengers",
"enplaned_freight",
"enplaned_mail")
Example 3, the code to read the data
example_3 <- read_fwf(
file=url_3,
fwf_positions(
start=start_column,
end=end_column),
col_names=variable_names,
col_types="ccnnnnn")
glimpse(example_3)
Break #6
- What you have learned
- What’s coming next
- Your programming assignment
This programming assignment was written by Steve Simon on 2024-12-18 and is placed in the public domain.
Program
- Download the xx program
- Store it in your src folder
- Modify the file names
- Use your last name instead of “simon”
- Modify the documentation headers
- Add your name
- Optional: change the copyright statement
Your submission
- Save the output in html format
- Convert it to pdf format.
- Make sure that the pdf file includes
- Your last name
- The number of this course
- The number of this module
- Upload the file
Summary
- What you have learned
- Reading text files
- Comma delimited files
- Tab delimited files
- Other delimiters
- Fixed width files
- Real world examples
- Your programming assignment